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GENETIC COMPOSITIONS AND METHODS 

C ROSS-REFERENCE TO RELATED APPLICATIONS 
This application is a continuation-in-part of USSN 08/813, 159, filed March 
7, 1997 and USSN 60/042,125 filed March 28, 1997, which are incorporated by reference 
in their entirety for all purposes. 

BACKGROUND OF THE INVENTION 
The genomes of all organisms undergo spontaneous mutation in the course of 
their continuing evolution generating variant forms of progenitor sequences (Gusella, Ann. 
Rev. Biochem. 55, 831-854 (1986)). The variant form may confer an evolutionary advantage 
or disadvantage relative to a progenitor form or may be neutral. In some instances, a variant 
form confers a lethal disadvantage and is not transmitted to subsequent generations of the 
organism. In other instances, a variant form confers an evolutionary advantage to the species 
and is eventually incorporated into the DNA of many or most members of the species and 
effectively becomes the progenitor form. In many instances, both progenitor and variant 
form(s) survive and co-exist in a species population. The coexistence of multiple forms of 
a sequence gives rise to polymorphisms. 

Several different types of polymorphism have been reported. A restriction 
fragment length polymorphism (RFLP) means a variation in DNA sequence that alters the 
length of a restriction fragment as described in Botstein et al. t Am. J. Hum. Genet. 32, 314- 
331 (1980). The restriction fragment length polymorphism may create or delete a restriction 
site, thus changing the length of the restriction fragment. RFLPs have been widely used in 
human and animal genetic analyses (see WO 90/13668; W090/11369; Donis-Keller, Cell 51, 
319-337 (1987); Lander et al., Genetics 121, 85-99 (1989)). When a heritable trait can be 
linked to a particular RFLP, the presence of the RFLP in an individual can be used to predict 
the likelihood that the animal will also exhibit the trait. 

Other polymorphisms take the form of short tandem repeats (STRs) that include 
tandem di-, tri- and tetra-nucleotide repeated motifs. These tandem repeats are also referred 
to as variable number tandem repeat (VNTR) polymorphisms. VNTRs have been used in 
identity and paternity analysis (US 5,075,217; Armour et al., FEES Lett. 307, 113-115 
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(1992); Horn et al., WO 91/14003; Jeffreys, EP 370,719), and in a large number of genetic 
mapping studies. 

Other polymorphisms take the form of single nucleotide variations between 
individuals of the same species. Such polymorphisms are far more frequent than RFLPs, 
STRs and VNTRs. Some single nucleotide polymorphisms occur in protein-coding sequences , 
in which case, one of the polymorphic forms may give rise to the expression of a defective 
or other variant protein and, potentially, a genetic disease. Examples of genes, in which 
polymorphisms within coding sequences give rise to genetic disease include p-globin (sickle 
cell anemia) and CFTR (cystic fibrosis). Other single nucleotide polymorphisms occur in 
noncoding regions. Some of these polymorphisms may also result in defective protein 
expression (e.g., as a result of defective splicing). Other single nucleotide polymorphisms 
have no phenotypic effects. 

Single nucleotide polymorphisms can be used in the same manner as RFLPs, and 
VNTRs but offer several advantages. Single nucleotide polymorphisms occur with greater 
frequency and are spaced more uniformly throughout the genome than other forms of 
polymorphism. The greater frequency and uniformity of single nucleotide polymorphisms 
means that there is a greater probability that such a polymorphism will be found in close 
proximity to a genetic locus of interest than would be the case for other polymorphisms. 
Also, the different forms of characterized single nucleotide polymorphisms are often easier 
to distinguish that other types of polymorphism (e.g., by use of assays employing allele- 
specific hybridization probes or primers). 

Despite the increased amount of nucleotide sequence data being generated in 
recent years, only a minute proportion of the total repository of polymorphisms in humans 
and other organisms has so far been identified. The paucity of polymorphisms hitherto 
identified is due to the large amount of work required for their detection by conventional 
methods. For example, a conventional approach to identifying polymorphisms might be to 
sequence the same stretch of oligonucleotides in a population of individuals by didoxy 
sequencing. In this type of approach, the amount of work increases in proportion to both the 
length of sequence and the number of individuals in a population and becomes impractical for 
large stretches of DNA or large numbers of persons. 
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SUMMARY OF TTTK TNVFNTTn N 

The invention provides nucleic acid segments of between 10 and 100 bases from 
a fragment shown in Table 1, column 1 including a polymorphic site. Complements of these 
segments are also included. The segments can be DNA or RNA, and can be double- or 

5 single-stranded. Some segments are 10-20 or 10-50 bases long. Preferred segments include 
a diallelic polymorphic site. The base occupying the polymorphic site in the segments can 
be the reference (Table 1, column 3) or an alternative base (Table 1, column 5). 

The invention further provides allele-specific oligonucleotides that hybridizes to 
a segment of a fragment shown in Table 1, column 8 or its complement. These. 

O oligonucleotides can be probes or primers. Also provided are isolated nucleic acids 
comprising a sequence of Table 1, column 8, or the complement thereto, in which the 
polymorphic site within the sequence is occupied by a base other than the reference base 
shown in Table 1, column 3. 

The invention further provides a method of analyzing a nucleic acid from an 

5 individual. The method determines which base is present at any one of the polymorphic s ites 
shown in Table 1. Optionally, a set of bases occupying a set of the polymorphic sites shown 
in Table 1 is determined. This type of analysis can be performed on a plurality of individuals 
who are tested for the presence of a disease phenotype. The presence or absence of disease 
phenotype can then be correlated with a base or set of bases present at the polymorphic sites 

0 in the individuals tested. 



DEFINITIONS 

An oligonucleotide can be DNA or RNA, and single- or double-stranded. 
Oligonucleotides can be naturally occurring or synthetic, but are typically prepared by 
synthetic means. Preferred oligonucleotides of the invention include segments of DNA, or 
5 their complements including any one of the polymorphic sites shown in Table 1. The 
segments are usually between 5 and 100 bases, and often between 5-10, 5-20, 10-20, 10-50, 
20-50 or 20-100 bases. The polymorphic site can occur within any position of the segment. 
The segments can be from any of the allelic forms of DNA shown in Table 1. 

Hybridization probes are oligonucleotides capable of binding in a base-specific 
3 manner to a complementary strand of nucleic acid. Such probes include peptide nucleic acids. 
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as described in Nielsen et al. t Science 254, 1497-1500 (1991). 

The term primer refers to a single-stranded oligonucleotide capable of acting as 
a point of initiation of template-directed DNA synthesis under appropriate conditions (i.e., 
in the presence of four different nucleoside triphosphates and an agent for polymerization, 
5 such as, DNA or RNA polymerase or reverse transcriptase) in an appropriate buffer and at 
a suitable temperature. The appropriate length of a primer depends on the intended use of the 
primer but typically ranges from 15 to 30 nucleotides. Short primer molecules generally 
require cooler temperatures to form sufficiently stable hybrid complexes with the template. 
A primer need not reflect the exact sequence of the template but must be sufficiently 

1 0 complementary to hybridize with a template. The term primer site refers to the area of the 
target DNA to which a primer hybridizes. The term primer pair means a set of primers 
including a 5 1 upstream primer that hybridizes with the 5 1 end of the DNA sequence to be 
amplified and a 3\ downstream primer that hybridizes with the complement of the 3 1 end of 
the sequence to be amplified. 

1 5 Linkage describes the tendency of genes, alleles, loci or genetic markers to be 

inherited together as a result of their location on the same chromosome, and can be measured 
by percent recombination between the two genes, alleles, loci or genetic markers. 

Polymorphism refers to the occurrence of two or more genetically determined 
alternative sequences or alleles in a population. A polymorphic marker or site is the locus 

20 at which divergence occurs. Preferred markers have at least two alleles, each occurring at 
frequency of greater than 1 %, and more preferably greater than 10% or 20% of a selected 
population. A polymorphic locus may be as small as one base pair. Polymorphic markers 
include restriction fragment length polymorphisms, variable number of tandem repeats 
(VNTR f s), hypervariable regions, minisatellites, dinucleotide repeats, trinucleotide repeats, 

25 tetranucleotide repeats, simple sequence repeats, and insertion elements such as Alu. The first 
identified allelic form is arbitrarily designated as a the reference form and other allelic forms 
are designated as alternative or variant alleles. The allelic form occurring most frequently in 
a selected population is sometimes referred to as the wildtype form. Diploid organisms may 
be homozygous or heterozygous for allelic forms. A diallelic polymorphism has two forms. 

30 A triallelic polymorphism has three forms. 

A single nucleotide polymorphism occurs at a polymorphic site occupied by a 
single nucleotide, which is the site of variation between allelic sequences. The site is usually 
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preceded by and followed by highly conserved sequences of the allele (e.g., sequences that 
vary in less than 1/100 or 1/1000 members of the populations). 

A single nucleotide polymorphism usually arises due to substitution of one 
nucleotide for another at the polymorphic site. A transition is the replacement of one purine 
5 by another purine or one pyrimidine by another pyrimidine. A transversion is the replacement 
of a purine by a pyrimidine or vice versa. Single nucleotide polymorphisms can also arise 
from a deletion of a nucleotide or an insertion of a nucleotide relative to a reference allele. 

Hybridizations are usually performed under stringent conditions, for example, at 
a salt concentration of no more than 1 M and a temperature of at least 25 °C. For example, 

1 0 conditions of 5X SSPE (750 mM NaCl, 50 mM NaPhosphate, 5 mM EDTA, pH 7.4) and a 
temperature of 25-30°C are suitable for allele-specific probe hybridizations. 

An isolated nucleic acid means an object species invention that is the predominant 
species present (i.e., on a molar basis it is more abundant than any other individual species 
in the composition). Preferably, an isolated nucleic acid comprises at least about 50, 80 or 

15 90 percent (on a molar basis) of all macromolecular species present. Most preferably, the 
object species is purified to essential homogeneity (contaminant species cannot be detected in 
the composition by conventional detection methods). 

Linkage disequilibrium or allelic association means the preferential association of a 
particular allele or genetic marker with a specific allele, or genetic marker at a nearby 

20 chromosomal location more frequently than expected by chance for any particular allele 
frequency in the population. For example, if locus X has alleles a and b, which occur equally 
frequently, and linked locus Y has alleles c and d, which occur equally frequently, one would 
expect the combination ac to occur with a frequency of 0.25. If ac occurs more frequently, 
then alleles a and c are in linkage disequilibrium. Linkage disequilibrium may result from 

25 natural selection of certain combination of alleles or because an allele has been introduced into 
a population too recently to have reached equilibrium with linked alleles. 

A marker in linkage disequilibrium can be particularly useful in detecting susceptibility 
to disease (or other phenotype) notwithstanding that the marker does not cause the disease. 
For example, a marker (X) that is not itself a causative element of a disease, but which is in 

30 linkage disequilibrium with a gene (including regulatory sequences) (Y) that is a causative 
element of a phenotype, can be used detected to indicate susceptibility to the disease in 



c*i idotiti rrer eurrr /Dill C OR\ 



WO 98/38846 PCIYUS98/04571 

"6- 

circumstances in which the gene Y may not have been identified or may not be readily 
detectable. 

The present invention includes the use of any of the polymorphic forms shown in Table 
1 as a means to determine susceptibility to a phenotype resulting from an allele or marker in 
linkage disequilibrium with such polymorphic forms. 

DESCRIPTION OF THE PRE SENT INVENTION 
I- Novel Polymorphisms of the Invention 

The novel polymorphisms of the invention are listed in Table 1. The first column 
of the Table lists the names assigned to the fragments in which the polymorphisms occur. 
The fragments are all human genomic fragments. SGC, TIGR and WI respectively stand for 
Stanford Genome Center, The Institute for Genome Research and the Whitehead Institute. 
The sequence of one allelic form of each of the fragments (arbitrarily referred to as the 
prototypical or reference form) has been previously been determined. Many of these 
sequences are listed at http://www-genome.wi.mit.edu/); http://shgc.stanford.edu; or 
http://ww.tigr.org/. The Web sites also list primers for amplification of the fragments, and 
the genomic location of fragments. Some fragments are expressed sequence tags, and some 
are random genomic fragments. All information in the websites concerning the fragments 
listed in Table 1 is incorporated by reference in its entirety for all purposes. 

The second column lists the position in the fragment in which a polymorphic site has 
been found. Positions are numbered consecutively with the first base of the fragment 
sequence as listed in one of the above databases being assigned the number one. The third 
column lists the base occupying the polymorphic site in the sequence in the data base. This 
base is arbitrarily designated the reference or prototypical form but is not necessarily the most 
frequently occurring form. The fifth column in the table lists the alternative base(s) at the 
polymorphic site. The eighth column of the Table lists about 15 bases of sequence on either 
side of the polymorphic site in each fragment. The indicated sequences can be either DNA 
or RNA. In the latter, the T's shown in the Table are replaced by U's. The base occupying 
the polymorphic site is indicated in EUPAC-IUB ambiguity code. The fourth and sixth 
columns of the table show the frequency with which reference and alternative alleles occur 
at a polymorphic site. The seventh column in the table indicates the population frequency of 
heterozygotes of the polymorphic site. 
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Sequence tag 


TTAAGTGAQAMTGTTTTAAAC 


AGAGCCGTCTYCTCAGGTTGC 


GTTGCCTGTCSTCTCCTGGCC 


GGCCGCATCCSTTAGTTTCCA 


AGAGAAAAAAYCAACAGCAAA 


CAGCAAACAAMACCACACAAA 


CAATAAGCACKCATGACCTCA 


GTGATTTGGTMAGCATATCTT 


TGTACTTTGGRCTCCAGACTT 


AGTAGAAAAGSCTTCTAGGTT 


CCCCCGCCTAMCTGGAGATGT 


ACGCCACAGARTCCTCCAATT 


ATAGAGAAATRAAAACCCAAT 


CCCAATTTCTYTTTCACCATT 


TAI 1 1 1 1 IGTRTGACTCCTAT 


TTTAGACAGGSAGCAGAAGCA 


ACCAGACAAGRGATGTAGATT 


ATCAAAGCACWATCTGTGTTT 


GATGCCAGCARCACAACACCC 


TGGGAAGAGTMTGTGACTTTA 


TAGAAAGTAAYTGCATTTCAG 


CTCCCCACCCWAAAATAACGT 


TCCCCACCCAWAAATAACGTA 


TACCTATGTCRTGCCATGTAG 


GAAACATACARTGTAATAGAA 


ATTTTATTTGMGCCCTAGGAG 


GTAGGTCCTGSTCTCCTATCA I 
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TTTCTTTTTGKCTCTTAGAAT 


ATGGAGGGGGSTGCAGGTTGG 


GACCCCATTGWCTTACGCAAA 


TAAAAAAGCCWAAAGACAGCC 


TGGATAGGTCRACCGGCTGAA 


ACAAAAGGACSAAAAACACTC 


CTCCCTTTCTKCCTGGCCCTT 


AGGACACTCARTTCACATGCC 


AAACCATGAAYGGTATAAGGA 
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CCCAATTAGARCCATGTCATT 


AGCGGATTATRTCTGACGCCA 


CTTAGACTGARATTCATAAAG 


AQGGATGACAMAAATCACTAA 


ATTCCTAAAAMAAAGAAAAGT 


TGCTTGATTTRGGAGATAAAA 


CTCCATCCTASGATTCTGCCT - 


TTAGTTTTGTWTTACTAAAAC 


CCAGGAATCGRCAATGCTAAT 


GGCCTCCCCTRCCCTGATCAT 


CCTTAGTTTCMTAAAAGCCCC 
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AAACATGTGCGAAAARAAGTGTGGGAATCAC 


AAGGATTAAGTTTAARCCACACTACCAAAAG 
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TTAACCCAGAGTCGCMTCTCTTCAAAATGCA 
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TCTGTGGTCCCTTTAYAAAGCCTCTTGCATC 
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AGAAGCCAGTCATACKTGCTTTAAAATTGAC 


TTCTTCCCAGGTTCTKGTGGTGGCTGTCAAT 
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AGAAAAATCCAAGAGYCTTAAACCATATTTT 


TTAGTTGATGAATTTRAATTTTACAQTATCT 


TTTATGCTGCAGTCGRAATACTTGGAGCCTG 


CTTGTTAAAGTCCCAYCAAAGAAAGGATCCC 


TGAAAAAAGGGAAAAYACCCATGTTTGCTAA 


CAGCCTTTTTAGAGTYCCTGGGCAATTTGTG 


ATAAGGAGGTGGGGAYGACACATTACTCTCC 


TAAATCATTCTAACAWCACAAATATCTTATT 


AGCATCGTGTCATTCWCAGTGTTTTAGGTTT 


TTTATCCGCAATAAAMTTCCCAAAGTCCTCG 


CTCAGTTTTTCCATCWTTTTTTCATAATTTA 


TATTTGGATAAGTTTSACAAAGATGAGAACA 


GTCCTAGAACCTCAGRATCGAAAGGAAGTTC 


CCTTGTTTTCTTTTGSATTGAAAAATACTGG 


ACATGATTCAATGATYCCATTTTGAAAATTA 


GGCTTCCTCTATGCAYGCGTCTATCTTCTAT 


TCCAATGTCGCATTCRTTTTGCCATTTCCTG 


TACAGAAAAAAAATTKTACATATCAAATGAC 


ACCATGGGAATCTTGRTGCAAGTTAGATCCC 


CAAAGGTCACAGGCARCGTACATACGGTTCT 
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CAAATCAATTACAACWATGTGCTTATCAGCT 


ACCCCTATATTTTAAWGCAACTGACAGTTTT | 
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CTATATCTT6TCACAKAGAAGTACCACACAT 


TTCTATAAAACAACAYAAGGAACGAGGCTCA 


TAGAGACtGAAGCTGKTATCAACCTTCCGTA 


TTTATTAAGQACATTSTGTAATGTTTCCACT 


CCACTTTGTTTTAAAYAATTACAAACATGTG 


ATAAAAGTTGTCATAYAGCAATGGATGCTGT 


CCAAAAACAAAGAATRAACATTGGAATAGTC 


CATTATTAAGGAGAGYACTAGGAAAAACTAC 


CTCTGGAGCCACAGCYGGCTAATACACTGCA 


ACAGCTGCAGAATGGMCTTCTTCCTTCCCAG 


CCCCAAAACATCACARAATTATTCATACTAT 


ATGCAGTTAAAATTCYAGAATAATTAAAAGC 


AGGCACCCAGCCATCSTGACCCAGCGAGGAG 


TGAGAGAGGAGCCACRGTCCCTAATGACACC 


agcttaactgacagaygttaaAgctttctgg 


TTGAAGAATATATTGWCAGAAACACAAGGCT 


TGTAAACAATTGTTAWGTGTTTAGAATCAGA 


GCTGGGCTGTGTTCCYCGGGCTCTTCTGGAC 


GAGAGGAAAGAAAAAWACAACTTTCATTCTT 


GGCTATGAAATAGTCYATTCAGTGAACTAGT 


CAGTCTTTGTCCTGGRAATATCTCACAAAAT 
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TCAATGCAACAAGTARAATTTQTAAACTCAA 


AATAGGAAACCAGAGRGGGAGCCCCAGGTGG 
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CTTAACCTTTGGCCTRCCTGCCTGGCTGTTT 


CGGGCATTGAGGATAYATGGAAGGCTCAGGA 


TGAGGAAGACAGTCAYGGTCGAACAAACAAC 


AGTCATGGTCGAACARACAACATGCTTCGGA 
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AAAGACACCATTTATMTACCCAAGGGCAGAA 


AAACATAATTGATTCRTATCTGCGAGACTTA 


TTTGCTCTAAAAGAARAAGGAACTAGGTCAA 


TAAGCATTGCCTGGCYTTCCTGTCTAGTCTC 


TAGAGATAATAATCARTTCTTTACAACCGAT 


CCATTCTCCTAtTTAYCAGTCCTGTCCTATA 


CCACTTCTCCCCGCARACCTAGGTCAGACTT 


gtctgccttaaagcartacccccctaccaca 


GGTCCCCCAGATTGASGTCTGAGTGTGGGCA 


GACTTCACTTTGGTGYCAATGGACAGAAAAT 


CTTGCTGGCTACTGGRTGTTAGTTTGCAGTC | 


ATGATCACCGACTGARAATATTGTTTTACAA 


CAACATCCTCTGCCAYACACAACAAAACGTA 
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QTTATTATGCTCTTARTGATTTACAGACTGA 


TCTG CTTTAACTTGG Y ATTCCTCT AATTGTG 


CTGCTATTCCCAGATSAAGATTTGGTGGAAG 


GCCCAGCTACAGCCTYGGTGCATCTTAACCC 


AAAATACCCTTCTCTRATAATTTAAGTAACC 


CTTCTCTAATAATTTRAGTAACCAAAATATT 


TATGTAGCAAATCTAWTCCCCTAAGCACAGT 


GTATTAAATAAATTAYGTTAACTGGCTCTGA 


AAATCATGACTTTTTWAAAAATACCAGACTA 


CAGGATCAGGGAAGGMATTATAATAAATATA 


TGATTGTTTTACATGYGAAATCTGGCTTCAG 


GTCCCCAAACTCTTAYTTAATTCCATTCAAT 


ACCTCTATTCTCTTAYTAAACTTTTGGATAC 


CAACCAGGTCTTGTTYCTACCCCTCTTAGAG 


CAGGTATGACTCCCARTCAACTTCTTGACTC 


TTACCCTTTGTCATTSTCAGACCAAGTACAT 


AAACTCTGCGGTGTGRAGAAAGGACAGTTAT 


ATTTATCTAGCCTGTWCAAGTCATCCAGTGA 


TTTATCTAGCCTGTAYAAGTCATCCAGTGAG 


TAATAACGTGTTGCAYACCTCACCAGAACTG 


GGGGGAGTtCAGACAMAGCCAAGAAAAGCCT 


TTTATATCCATCTTCYATTTTAATTTTCTAC 


AAATATTATTCTTTTYTCATATTTTCCAATT 


TTTCCAATTATTAATMCTAGAATTTTCACCA 


GTCTTCTAATAGCAAMAGCTACTGGAAGCGG 


TGCCCCTGTCCAAGGYTGTGTCTACACATGA 
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TAATTCATTACACtCYACATCATATTTTCTT 


GAGGAACATTTAGAQRQTCCATCTCTQATGT 


ACACTGCTCTAGACCYTCCCAGGGTCCCTCA 


TCATGGGCAGGAATTYCATTTCTGTGTTTCT 


CAGAATTACTTGGCAYAGGGTTTCTT AAAAC 


ATCTGCAGGCTCTCCSTTTCTAAGTCACCTG 


CAAAAGTGTGTTAATYCTTAATACCAATTTT | 


TACGCTTTTAAAAAAWAATAAAAATACTGTA 


TGTTTCAACTAAGGAYAGACTTCAGAAGGCA 


TCAGCCAGCTATCTTKGGTGCAGAGAGGTAC 


AGGTACTCCAAGTACYGTGGGGGTTCTGATG 


AAGGGGGAGCAGGCAYGTCACATACCCAGAG 


GAGAGAGAAAGAGAGRAAGTGCCACACATTT | 


CTCACCTAAATTATGMGTGATTAAAATATAC 


GCTTTAAGTACTTTASGAAGACCTTGACTGT 


ATGACCAAAATGAGAYAAATTTGTTAAAAAA 


AAAAAATTTAAGCCTRAAGTAGTGC I I I I I A 


AAAAAAGAGCAGACAKTTTATCATGTGTTCT 


II ICIGCICAAAGAUWI I I 1 1 1 1 A AG I IAIC 


TGAAAAGAAAAACTTWCACCTTTTATTTTAA | 
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AAAATATAATTTGCTRTAGAGTTCACAGATG 


CACAGCATCACACCAYAGGGCCCACGGGAGG 


AATAAATTTTTTTAARAAGGTTTAGCTATTC 


AAATCATGTGCCCCASAGAGCCCCAAAGCTT 


GCACATAGTGGAAAGYGCTAAGTGTCCTACG 


GTCAGATCATATCCAYAGAAAAACAGCTCTC 


G AGATTCTGATTCAGYGTGCTCAGGCGGGGC 


TAAAAGTCTCTTCAGYAGGAAAAAAGCTACA 


CACGTAACT AAGTTCMTATAATTTTAACTTG 


AACTTTAATAAATACKCTTTTTACAAAACAC 
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AAATAACCACAGCAGYTTTCAGTATAATTTG 


GTACAATTTATTTGCYGGCTGGAATTTGTTC 
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AGCCTCAGTCTTCACMCTCCTCCCTCCTCCA 


TGTAAAACAGACAAAMTGCATTACAACTGTG 


cd 

CD 
CD 

s 

CD 
CD 
< 

< 

CJ 

u 

CD 

i 

CD 
< 
O 
CD 

cd 

g 

CD 
CD 


AAACAACTAtCAACASCTGCAACACAAACCA 


TTTATTTATCAAACTSCAATTCCATTTCACA 


TGTGGTTTTCGCCTGRTAGACCACAGGGCGA 


CCI III III ICCCCCYUIUAI I (J 1 IAAI IAU 


TTACCAAACCTCTGTRGCTTAGCCTCGCCTA 


AGAGTGGGCAGTTCAKGTTTTATTAGTATAT 


GTATTTAGTATACAGMAGTGATTTTCTCTCT 


AG AAAG AATCTG AATRTGAGGG AACTGC AG A 


TGTTGGGTGGTCAAGRCTATTCAGAAAATCT 


CTTTGTCCTGGAGACMCCAGCTAGTCTAAGA 


CTCTGGTTTATTTAAKATCAACATTCACCAC 


GAATCCAGGACACAASAAGAAAAACACCCAA 


ATGGAGAGAGAAGACRAGACACAACTCCTCC 


CAACTCCTCCCCCACYGCCTCCGTGCTCTAG 


AGCCAGCTCTGACTTWCTCTCTGTTTCTGTC 


GAATACATGACCATTYCTCTTTTAGCACGTT 


GGGCACGGGGGAGGCRGAAGGAAGAGAAAGA 


GGAAAACTTGGATTTYCCAAGACCCGAAGAC 


TTAAACTCAAATATCYGAAATACTTTCATTA 


ACACCGTGCAAATGCYAAAGTGCACTGAGGA 


TAM I I C1 11 IGCTTS 1 1111 ICTTTCACCT 


TACAAAAAATCCTGCYCTTATAGAGCATACA 


GTACGGTGGAGGTCARGCATCTACAGGGTCA 


CTGATGACCTGCATGYGCCAGGTATGTGGTC 


AAACAACTATTGCATRGGAAAACATATGCAA 


AAAAAGAGTAAAAATKACCAAAAAATTAAAG 


I 

CD 

5 

O 

c 

b 
cd 

CD 
CD 
tn 

i 

i 


0.50| 


0.38 


0.47 


0.47 


0.50 


0.22 


0.12 


0.30 


0.22 


0.30 


0.43 


0.47 


0.12 


0.49 


0.22 


0.22 


0.49 


0.49 


0.12 


. 0.50 


0.30 


0.43 


0.50 


0.30 


0.22 


OS'O 


0.43 


0.22 


0.38 


0.49 


0.50 


0.25 


0.38 


0.63 


0.50 


0.13 


0.94 


0.19 


0.13 


0.19 


0.31 


0.63 


0.06 


0.44 


0.88 


0.88 


0.44 


0.44 


0.06 


0.50 


0.19 


0.69 


OS'O 


0.19 


0.13 


0.50 


0.31 


0.13 


0.25 


0.56 


< 


o 




u 


O 


o 


o 


o 


o 


< 


O 


© 


< 


o 


© 


< 


o 


l- 


o 


< 


i- 


t- 


H 


u 


H 


< 


t- 


a 


a 


© 


0.50 


0.75 


0,63 


0.38 


0.50 


0.88 


0.06 


6.81 


0.88 


0.81 


0.69 


0.38 


0.94 


0.56 


0.13 


0.13 


0.56 


0.56 


0.94 


0.50 


0.81 


0.31 


0.50 


0.81 


0.88 


0.50 


0.69 


0.88 


0.75 


0.44 


o 


< 


o 


O 


O 


< 


H 


< 


f- 


a 


< 


< 


O 


H 


u 


o 




< 


t— 


O 


o 


O 


O 


O 


O 


© 


O 


< 


t- 


< 


CO 

o 




CO 
CM 


CO 


CO 


CO 


co 
a> 


00 
CM 


o 

CO 


<0 


CN 
IO 


00 

r*. 


r— 

CO 


LO 
CD 


o 

CO 


CO 

to 


a> 

CO 


CO 


CN 


CO 

o 


CM 




CO 
CM 


CN 


r>. 
ta 


CO 

o> 


CO 


CO 


O) 
CO 


CO 

to 


WI-11710 


WM1715 


WI-11715 


WI-1 1727 


WI-1 1728 


WI-11758 


WI-1 1773 


WI-1 1790 


WI-11806 


WI-11879 


WI-1 1906 


WI-1 1909 


WM1946 


WI-1 1965 


WI-12002 


WI-12002 


WI-12002 


WI-12018 


WI-1 2020 


WH2075 


WI-1 2086' 


WI-1 21 08 


WI-1 21 59 


WI-1 21 69 


WI-1 21 73 


WI-1 2 179 


WI-1 2201 


WI-12210 


WI-12229 


WI-12234 



SUBSTITUTE SHEET (RULE 26) 



WO 98/38846 



PCT/US98/04571 



28 



TAATTTTAAAAAGCTRTTTAGGACCCAAACA 


GTTCTGCTCATAATTYCCAATATGTACCAGA 


GTACCTATGAAATAARACAGGTAGGGAATAT 


TCAAAAGCAATTCACRCTTCCAGAATACAAA 


CAATATAATTCCATTYCGAGTGATTAAAACC 


C AGG AAAAAGAGGAAM CCTG AACCCCTCTGC 


CAGCATATGTATTATYTGAACTAAATTTACA 


TATATTGTATTTCTAYTTGACAGCACAGTTC 


TTGAGGTGTAGATATWGTTCCTCTCTTCTCG 


TGAACATTTAAATGTYATCCATGTGAGGGCT 


AGGGCTCTAGATCATKGTAGGTGATTGATAC 


GGGCTCTAGATCATGKTAGGTGATTGATACA 


GTAAAGGAATGGGAAYGTGTTGGTGGTCGCT 


TATTCTTGCTTTGATYGTCTACGTAAGCATQ 


TGTCTAGCAGTATTAYGCTATTAGCTATGTT 


TGGCATTAAGGATGCRGTAGGATGTCCACTT 


TGTAAACAGCTGTGCKCCATTTAGGCTTTGT 


TCAAGGTAAAGTCCARTACAAAAAAACAGCA 


GTGCTCTCAGTACAAMAAACAGCATCAGTAG 


AACCCTGAGACTTTARATCTGCAAAGGGGTT 


GACTTAAGCI 1 1 1 1 1 yCTTTTTCCATATAAT 


GACACAATCAAGACTSACAGTAGCCTCAACC 


GGACTACAGGCATGTSACACCACACCTGGTT 


AAGGCTCTTGCCCATRTATTCCCGTCTCTCC 


TTTTTTAGTAGAAGCRGGAACAGTTGTCAAT 


IGAAGACTCACCAGAASAGGGTGGGGTGGGGA 


GAATAAACATCTCACRAACTGTCGCTCCTAG 


TGACAAGAACACATAMAAATATTGAAATTAT 
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AAATCTTQTCTCTTCWTQCTAGAAAQAQATQ 


ATATTGGAATTTCTAMAGAGACCCATGGTCT 


ATGGGCTGAGACTGtYTGTCTGGTAGATGCA 


TTGTTGGATAAAAGGRCATTGTTTTTCATTA 


TAGCTTGTCTTCAAARGACAGAGAAATAAGA 


AGCTTGACCTTAGGTYAATATTTCATTTGGG 


CCCCACTAATACAACYGAGAACCACTGACTT 


AAAAAGAAGACATTTRTTCAGAGAAAACTGT 


ATTGAACAGTTACCAYAAGCAAGAGAGTGAG 


AAAAACTCAGCGAAGYG AAAAGGTGGATAGC 


TATATTCAGACAATCRAATATTACTTAGCAC 


AGCAGAAAGAAAACCWAGACAAAAAGATGTT 


TCTAGAGACTGGGGAMTGGAATCTAACTGCG 


CAGATCACAAAAAGCRTGCACAAAAAAGTAC 


GAGCCAAGCATCCATKCCATCATCTAGTAAC 


tCTGGAGACAACACAKAAATCTATTAATATT 


TTTCACTTTTAAAACWTAAAAAACTACTCTT 


TGAAACACATCCGTARGTATGACATCATTTC 


ACCTATCTGCCCATGSTTTACAGCCTTTTAA 


ATTTTTATTCTATTGMATTATAAGAAAAGTG 
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AAGTGCTGGATATACYTGGCTTGCACCGGAC 


ATACTTGGCTTGCACYGGACAeCTTTTACGG 


GGACACTGCAGTGATYAGGGGCAGGTGTGGG 


ACTATAAAAGTGCTTYAAAATGCAGCAGCAG 


TTTAAAATGCAGCAGSAGGAGATGTGAAGAC 


AGGAGATGTGAAGACMCAAATGAACAAGTGC 


CAAATGAACAAGTGCRTAGTGACACATAGCT 


IGGATGGCTGAGGGAGRGAACAGAGGAAGCGC I 
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CAGCCTAGATATAGGSAQTAACAAATCCTCC 


ACCCTTTTCTTTCTCRTACAAGGTTAAGAGC 


AAGGCACACGGGGAARGGGTCAAGGCAGGCT 


AGGCACACGGGGAAGRGGTCAAGGCAGGCTG 


AACTAGGCCTCAGGTRCCCATTAAGCATGCT 


ATACATCCAAAACTTYAGTTAGCAGCAAGCA 


AGGTGACTTGGAAAASGAGATTCACATACTT 


CTTCTCTTCTGTAGAYGTCTCCATGTTACAQ 


TTTTAACACAGCCATRTTACAAACATTGTCA 
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Anafrsfe of Pptymprphiiaros 

A. Prg>aratioi*Qr Samples 

Polymorphisms are detected in a target nucleic acid from an individual 
being analyzed. For assay of genomic DNA, virtually any biological sample (other 
5 than pure red blood cells) is suitable. For example, convenient tissue samples include 
whole blood, semen, saliva, tears, urine, fecal material, sweat, buccal, skin and hair. 
For assay of cDNA or mRNA, the tissue sample must be obtained from an organ in 
which the target nucleic acid is expressed. For example, if the target nucleic acid is 
a cytochrome P450, the liver is a suitable source. 
10 Many of the methods described below require amplification of DNA 

from target samples. This can be accomplished by e.g., PCR. See generally PCR 
Technology: Principles and Applications for DNA Amplification (ed. H.A. Erlich, 
Freeman Press, NY, NY, 1992); PCR Protocols: A Guide to Methods and Applications 
(eds. Innis, et al., Academic Press, San Diego, CA, 1990); Mattila et al., Nucleic 
15 Acids Res. 19, 4967 (1991); EckertetaL, PCR Methods and Applications 1, 17 (1991); 
PCR (eds. McPherson et al., IRL Press, Oxford); and U.S. Patent 4,683,202 (each of 
which is incorporated by reference for all purposes). 

Other suitable amplification methods include the ligase chain reaction 
(LCR) (see Wu and Wallace, Genomics 4, 560 (1989), Landegren et al., Science 241, 
20 1077 (1988), transcription amplification (Kwoh et al., Proc. Natl Acad. ScL USA 86, 
1173 (1989)), and self-sustained sequence replication (GuateUi et al., Proc. Nat. Acad. 
Sci. USA, 87, 1874 (1990)) and nucleic acid based sequence amplification (NASBA). 
The latter two amplification methods involve isothermal reactions based on isothermal 
transcription, which produce both single stranded RNA (ssRNA) and double stranded 
25 DNA (dsDNA) as the amplification products in a ratio of about 30 or 100 to 1, 
respectively. 



B. Detection of Polymorphisms in Target DNA 

There are two distinct types of analysis depending whether a 
polymorphism in question has already been characterized. The first type of analysis 
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is sometimes referred to as de novo characterization. This analysis compares target 
sequences in different individuals to identify points of variation, i.e., polymorphic 
sites. By analyzing a groups of individuals representing the greatest ethnic diversity 
among humans and greatest breed and species variety in plants and animals, patterns 

5 characteristic of the most common alleles/haplotypes of the locus can be identified, and 
the frequencies of such populations in the population determined. Additional allelic 
frequencies can be determined for subpopulations characterized by criteria such as 
geography, race, or gender. The de novo identification of the polymorphisms of the 
invention is described in the Examples section. The second type of analysis is 

10 determining which form(s) of a characterized polymorphism are present in individuals 
under test. There are a variety of suitable procedures, which are discussed in turn. 

1. Allele-Specific Probes 

The design and use of allele-specific probes for analyzing 
polymorphisms is described by e.g., Saiki et al., Nature 324, 163-166 (1986); 

15 Dattagupta, EP 235,726, Saiki, WO 89/1 1548. Allele-specific probes can be designed 
that hybridize to a segment of target DNA from one individual but do not hybridize to 
the corresponding segment from another individual due to the presence of different 
polymorphic forms in the respective segments from the two individuals. Hybridization 
conditions should be sufficiently stringent that there is a significant difference in 

20 hybridization intensity between alleles, and preferably an essentially binary response, 
whereby a probe hybridizes to only one of the alleles. Some probes are designed to 
hybridize to a segment of target DNA such that the polymorphic site aligns with a 
central position (e.g., in a 15 mer at the 7 position; in a 16 mer, at either the 8 or 9 
position) of the probe. This design of probe achieves good discrimination in 

25 hybridization between different allelic forms. 

Allele-specific probes are often used in pairs, one member of a pair 
showing a perfect match to a reference form of a target sequence and the other member 
showing a perfect match to a variant form. Several pairs of probes can then be 
immobilized on the same support for simultaneous analysis of multiple polymorphisms 

30 within the same target sequence. 
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2. Tiling Arrays 

The polymorphisms can also be identified by hybridization to nucleic 
acid arrays, some example of which are described by WO 95/11995 (incorporated by 
reference in its entirety for all purposes). One form of such arrays is described in the 

5 Examples section in connection with de novo identification of polymorphisms. The 
same array or a different array can be used for analysis of characterized 
polymorphisms. WO 95/11995 also describes subarrays that are optimized for 
detection of a variant forms of a precharacterized polymorphism. Such a subarray 
contains probes designed to be complementary to a second reference sequence, which 

10 is an allelic variant of the first reference sequence. The second group of probes is 
designed by the same principles as described in the Examples except that the probes 
exhibit complementarity to the second reference sequence. The inclusion of a second 
group (or further groups) can be particular useful for analyzing short subsequences of 
the primary reference sequence in which multiple mutations are expected to occur 

15 within a short distance commensurate with the length of the probes (i.e., two or more 
mutations within 9 to 21 bases). 

3. Allele-Specific Primers 

An allele-specific primer hybridizes to a site on target DNA overlapping 
a polymorphism and only primes amplification of an allelic form to which the primer 

20 exhibits perfect complementarity . See Gibbs, Nucleic Acid Res. 17, 2427-2448 (1989) . 
This primer is used in conjunction with a second primer which hybridizes at a distal 
site. Amplification proceeds from the two primers leading to a detectable product 
signifying the particular allelic form is present. A control is usually performed with 
a second pair of primers, one of which shows a single base mismatch at the 

25 polymorphic site and the other of which exhibits perfect complementarity to a distal 
site. The single-base mismatch prevents amplification and no detectable product is 
formed. The method works best when the mismatch is included in the 3*-most position 
of the oligonucleotide aligned with the polymorphism because this position is most 
destabilizing to elongation from the primer. See, e.g., WO 93/22456. 
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4. Direct-Sequencing 

The direct analysis of the sequence of polymorphisms of the present 
invention can be accomplished using either the dideoxy chain termination method or 
the Maxam Gilbert method (see Sambrook et ah, Molecular Cloning, A Laboratory 
5 Manual (2nd Ed., CSHP, New York 1989); Zyskind et al. t Recombinant DNA 
Laboratory Manual, (Acad. Press, 1988)). 

5. Denaturing Gradient Gel Electrophoresis 

Amplification products generated using the polymerase chain reaction 
can be analyzed by the use of denaturing gradient gel electrophoresis. Different alleles 
10 can be identified based on the different sequence-dependent melting properties and 
electrophoretic migration of DNA in solution. Erlich, ed., PCR Technology, 
Principles and Applications for DNA Amplification, (W.H. Freeman and Co, New 
York, 1992), Chapter 7. 

6. Sin gle-Strand Conformation Polymorphism Analysis 

15 Alleles of target sequences can be differentiated using single-strand 

conformation polymorphism analysis, which identifies base differences by alteration 
in electrophoretic migration of single stranded PCR products, as described in Orita et 
aL, Proc. Nat. Acad. ScL 86, 2766-2770 (1989). Amplified PCR products can be 
generated as described above, and heated or otherwise denatured, to form single 

20 stranded amplification products. Single-stranded nucleic acids may refold or form 
secondary structures which are partially dependent on the base sequence. The different 
electrophoretic mobilities of single-stranded amplification products can be related to 
base-sequence difference between alleles of target sequences. 



25 



IU. Methods of Use 

After determining polymorphic form(s) present in an individual at one 
or more polymorphic sites, this information can be used in a number of methods. 
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A. Forensics 

Determination of which polymorphic forms occupy a set of polymorphic 
sites in an individual identifies a set of polymorphic forms that distinguishes the 
individual. See generally National Research Council, The Evaluation cf Forensic DNA 

5 Evidence (Eds. Pollard et al. t National Academy Press, DC, 1996). The more sites 
that are analyzed the lower the probability that the set of polymorphic forms in one 
individual is the same as that in an unrelated individual. Preferably, if multiple sites 
are analyzed, the sites are unlinked. Thus, polymorphisms of the invention are often 
used in conjunction with polymorphisms in distal genes. Preferred polymorphisms for 

10 use in forensics are diallelic because the population frequencies of two polymorphic 
forms can usually be determined with greater accuracy than those of multiple 
polymorphic forms at multi-allelic loci. 

The capacity to identify a distinguishing or unique set of forensic 
markers in an individual is useful for forensic analysis. For example, one can 

15 determine whether a blood sample from a suspect matches a blood or other tissue 
sample from a crime scene by determining whether the set of polymorphic forms 
occupying selected polymorphic sites is the same in the suspect and the sample. If the 
set of polymorphic markers does not match between a suspect and a sample, it can be 
concluded (barring experimental error) that the suspect was not the source of the 

20 sample. If the set of markers does match, one can conclude that the DNA from the 
suspect is consistent with that found at the crime scene. If frequencies of the 
polymorphic forms at the loci tested have been determined (e.g., by analysis of a 
suitable population of individuals), one can perform a statistical analysis to determine 
the probability that a match of suspect and crime scene sample would occur by chance . 

25 

p(ID) is the probability that two random individuals have the same 
polymorphic or allelic form at a given polymorphic site. In diallelic loci, four 
genotypes are possible: AA, AB, BA, and BB. If alleles A and B occur in a haploid 
genome of the organism with frequencies x and y, the probability of each genotype in 
30 a diploid organism are (see WO 95/12607): 
Homozygote: p(AA)= x 2 
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Homozygote: p(BB)= y 2 = (1-x) 2 . 

Single Heterozygote: p(AB)= p(BA)= xy = x(l-x) 

Both Heterozygotes: p(AB+BA)= 2xy - 2x(l-x) 

The probability of identity at one iocus (i.e, the probability that two 
5 individuals, picked at random from a population will have identical polymorphic forms 
at a given locus) is given by the equation: 
p(ID) = (x 2 ) 2 + (2xy) 2 + (y 2 ) 2 . 

These calculations can be extended for any number of polymorphic 
forms at a given locus. For example, the probability of identity p(BD) for a 3-allele 
10 system where the alleles have the frequencies in the population of x, y and z, 
respectively, is equal to the sum of the squares of the genotype frequencies: 
p(ID) = x 4 + (2xy) 2 + (2yz) 2 + (2xz) 2 + z 4 + y 4 

In a locus of n alleles, the appropriate binomial expansion is used to 
calculate p(ID) and p(exc). 
15 The cumulative probability of identity (cum p(ID)) for each of multiple 

unlinked loci is determined by multiplying the probabilities provided by each locus. 
cump(ID)= p(IDl)p(ID2)pOD3)....p(IDn) 

The cumulative probability of non-identity for n loci (i.e. the probability 
that two random individuals will be different at 1 or more loci) is given by the 
20 equation: 

cump(nonlD) — l-cump(ID). 

If several polymorphic loci are tested, the cumulative probability of non- 
identity for random individuals becomes very high (e.g., one billion to one). Such 
probabilities can be taken into account together with other evidence in determining the 
25 guilt or innocence of the suspect. 

B. Paternity Testing 

The object of paternity testing is usually to determine whether a male 
is the father of a child. In most cases, the mother of the child is known and thus, the 
mothers contribution to the child's genotype can be traced. Paternity testing 
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investigates whether the part of the child's genotype not attributable to the mother is 
consistent with that of the putative father. Paternity testing can be performed by 
analyzing sets of polymorphisms in the putative father and the child. 

If the set of polymorphisms in the child attributable to the father does 
5 not match the putative father, it can be poncluded, barring experimental error, that the 
putative father is not the real father. If the set of polymorphisms in the child 
attributable to the father does match the set of polymorphisms of the putative father, 
a statistical calculation can be performed to determine the probability of coincidental 
match. 

10 The probability of parentage exclusion (representing the probability that 

a random male will have a polymorphic form at a given polymorphic site that makes 
him incompatible as the father) is given by the equation (see WO 95/12607): 
p(exc) = xy(l-xy) 

where x and y are the population frequencies of alleles A and B of a diallelic 
15 polymorphic site. 

(At a triallelic site p(exc) = xy(l-xy) 4- yz(l- yz) + xz(l-xz)-f 3xyzQ- 
xyz))) t where x, y and z and the respective population frequencies of alleles A, B and 
C). 

The probability of non-exclusion is 
20 p(non-exc) = l-p(exc) 

The cumulative probability of non-exclusion (representing the value 
obtained when n loci are used) is thus: 

cum p(non~exc) = p(non-excl)p(non-exc2)p(non-exc3).... p(non-excn) 

The cumulative probability of exclusion for n loci (representing the 
25 probability that a random male will be excluded) 
cum p(exc) = 1 - cum p(non-exc). 

If several polymorphic loci are included in the analysis, the cumulative 
probability of exclusion of a random male is very high. This probability can be taken 
into account in assessing the liability of a putative father whose polymorphic marker 
30 set matches the child's polymorphic marker set attributable to his/her father. 
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C. Correlation of Polymorphisms with Phenotvpic Traits 

The polymorphisms of the invention may contribute to the phenotype of 
an organism in different ways. Some polymorphisms occur within a protein coding 
sequence and contribute to phenotype by affecting protein structure. The effect may 
5 be neutral, beneficial or detrimental, or both beneficial and detrimental, depending on 
the circumstances. For example, a heterozygous sickle cell mutation confers resistance 
to malaria, but a homozygous sickle cell mutation is usually lethal. Other 
polymorphisms occur in noncoding regions but may exert phenotypic effects indirectly 
via influence on replication, transcription, and translation. A single polymorphism 
10 may affect more than one phenotypic trait. Likewise, a single phenotypic trait may be 
affected by polymorphisms in different genes. Further, some polymorphisms 
predispose an individual to a distinct mutation that is causally related to a certain 
phenotype. 

Phenotypic traits include diseases that have known but hitherto 
IS unmapped genetic components (e.g., agammaglobulimenia, diabetes insipidus, Lesch- 
Nyhan syndrome, muscular dystrophy, Wiskott-Aldrich syndrome, Fabry's disease, 
familial hypercholesterolemia, polycystic kidney disease, hereditary spherocytosis, von 
Willebrand's disease, tuberous sclerosis, hereditary hemorrhagic telangiectasia, familial 
colonic polyposis, Ehlers-Danlos syndrome, osteogenesis imperfecta, and acute 
20 intermittent porphyria). Phenotypic traits also include symptoms of, or susceptibility 
to, multifactorial diseases of which a component is or may be genetic, such as 
autoimmune diseases, inflammation, cancer, diseases of the nervous system, and 
infection by pathogenic microorganisms. Some examples of autoimmune diseases 
include rheumatoid arthritis, multiple sclerosis, diabetes (insulin-dependent and non- 
25 independent), systemic lupus erythematosus and Graves disease. Some examples of 
cancers include cancers of the bladder, brain, breast, colon, esophagus, kidney, 
leukemia, liver, lung, oral cavity, ovary, pancreas, prostate, skin, stomach and uterus. 
Phenotypic traits also include characteristics such as longevity, appearance (e.g., 
baldness, obesity), strength, speed, endurance, fertility, and susceptibility or 
30 receptivity to particular drugs or therapeutic treatments. 

Correlation is performed for a population of individuals who have been 
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tested for the presence or absence of a phenotypic trait of interest and for polym orphic 
markers sets. To perform such analysis, the presence or absence of a set of 
polymorphisms (i.e. a polymorphic set) is determined for a set of the individuals, some 
of whom exhibit a particular trait, and some of which exhibit lack of the triait. The 
5 alleles of each polymorphism of the set are then reviewed to determine whether the 
presence or absence of a particular allele is associated with the trait of interest. 
Correlation can be performed by standard statistical methods such as a K-squared test 
and statistically significant correlations between polymorphic form(s) and phenotypic 
characteristics are noted. For example, it might be found that the presence of allele 
10 Al at polymorphism A correlates with heart disease. As a further example, it might 
be found that the combined presence of allele Al at polymorphism A and allele Bl at 
polymorphism B correlates with increased milk production of a farm animal. 

Such correlations can be exploited in several ways. In the case 
of a strong correlation between a set of one or more polymorphic forms and a disease 
15 for which treatment is available, detection of the polymorphic form set in a human or 
animal patient may justify immediate administration of treatment, or at least the 
institution of regular monitoring of the patient. Detection of a polymorphic form 
correlated with serious disease in a couple contemplating a family may also be valuable 
to the couple in their reproductive decisions. For example, the female partner might 
20 elect to undergo in vitro fertilization to avoid the possibility of transmitting such a 
polymorphism from her husband to her offspring. In the case of a weaker, but still 
statistically significant correlation between a polymorphic set and human disease, 
immediate therapeutic intervention or monitoring may not be justified. Nevertheless, 
the patient can be motivated to begin simple life-style changes (e.g., diet, exercise) that 
25 can be accomplished at little cost to the patient but confer potential benefits in reducing 
the risk of conditions to which the patient may have increased susceptibility by virtue 
of variant alleles. Identification of a polymorphic set in a patient correlated with 
enhanced receptiveness to one of several treatment regimes for a disease indicates that 
this treatment regime should be followed. 
30 For animals and plants, correlations between characteristics and 

phenotype are useful for breeding for desired characteristics. For example, Beitz et 
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al., US 5,292,639 discuss use of bovine mitochondrial polymorphisms in a breeding 
program to improve milk production in cows. To evaluate the effect of mtDNA D- 
loop sequence polymorphism on milk production, each cow was assigned a value of 1 
if variant or 0 if wildtype with respect to a prototypical mitochondrial DN A sequence 

5 at each of 17 locations considered. Each production trait was analyzed individually 
with the following animal model: 

Y iJkpn = ii + YSj + Pj + X k + p, + ... Pn + PE n + a, +e p 
where Y ijknp is the milk, fat, fat percentage, SNF, SNF percentage, energy 
concentration, or lactation energy record; /z is an overall mean; YSi is the effect 

10 common to all cows calving in year-season; X k is the effect common to cows in either 
the high or average selection line; pi to p^ are the binomial regressions of production 
record on mtDNA D-loop sequence polymorphisms; PE n is permanent environmental 
effect common to all records of cow n; a n is effect of animal n and is composed of the 
additive genetic contribution of sire and dam breeding values and a Mendelian 

15 sampling effect; and e p is a random residual. It was found that eleven of seventeen 
polymorphisms tested influenced at least one production trait. Bovines having the best 
polymorphic forms for milk production at these eleven loci are used as parents for 
breeding the next generation of the herd. 

20 

D. Genetic Map ping of Phenotvoic Traits 

The previous section concerns identifying correlations between 
phenotypic traits and polymorphisms that directly or indirectly contribute to those 
traits. The present section describes identification of a physical linkage between a 

25 genetic locus associated with a trait of interest and polymorphic markers that are not 
associated with the trait, but are in physical proximity with the genetic locus 
responsible for the trait and co-segregate with it. Such analysis is useful for mapping 
a genetic locus associated with a phenotypic trait to a chromosomal position, and 
thereby cloning gene(s) responsible for the trait. See Lander et al., Proc> NatL Acad. 

30 Sci. (USA) 83, 7353-7357 (1986); Lander et al., Proc. NatL Acad. Sci. (USA) 84, 
2363-2367 (1987); Donis-Keller et al., Cell 51, 319-337 (1987); Lander et al., 
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Genetics 121, 185-199 (1989)). Genes localized by linkage can be cloned by a process 
known as directional cloning. See Wainwright, Med. J. Australia 159, 170-174 
(1993); Collins, Nature Genetics 1, 3-6 (1992) (each of which is incorporated by 
reference in its entirety for all purposes). 
5 Linkage studies are typically performed on members of a family. 

Available members of the family are characterized for the presence or absence of a 
phenotypic trait and for a set of polymorphic markers. The distribution of polymorphic 
markers in an informative meiosis is then analyzed to determine which polymorphic 
markers co-segregate with a phenotypic trait. See, e.g., Kerem et al., Science 245, 
10 1073-1080 (1989); Monaco et al., Nature 316, 842 (1985); Yamoka et al., Neurology 
40, 222-226 (1990); Rossiter et al., FASEB Journal 5, 21-27 (1991). 
Linkage is analyzed by calculation of LOD (log of the odds) values. A lod value is the 
relative likelihood of obtaining observed segregation data for a marker and a genetic 
locus when the two are located at a recombination fraction 0, versus the situation in 
15 which the two are not linked, and thus segregating independently (Thompson & 
Thompson, Genetics in Medicine (5th ed, W.B. Saunders Company, Philadelphia, 
1991); Strachan, "Mapping the human genome" in The Human Genome (BIOS 
Scientific Publishers Ltd, Oxford), Chapter 4). A series of likelihood ratios are 
calculated at various recombination fractions (6), ranging from 6 = 0.0 (coincident 
20 loci) to 0 - 0.50 (unlinked). Thus, the likelihood at a given value of 6 is: probability 
of data if loci linked at 0 to probability of data if loci unlinked. The computed 
likelihoods are usually expressed as the log, 0 of this ratio (i.e., a lod score). For 
example, a lod score of 3 indicates 1000:1 odds against an apparent observed linkage 
being a coincidence. The use of logarithms allows data collected from different 
25 families to be combined by simple addition. Computer programs are available for the 
calculation of lod scores for differing values of 0 (e.g., LIPED, MLINK (Lathrop, 
Proc. Nat. Acad. ScL (USA) 81, 3443-3446 (1984)). For any particular lod score, a 
recombination fraction may be determined from mathematical tables. See Smith et al. , 
Mathematical tables for research workers in human genetics (Churchill, London, 
30 1961); Smith, Arm. Hum. Genet. 32, 127-150 (1968). The value of 0 at which the lod 
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score is the highest is considered to be the best estimate of the recombination fraction. 

Positive lod score values suggest that the two loci are linked, whereas 
negative values suggest that linkage is less likely (at that value of 6) than the possibility 

5 that the two loci are unlinked. By convention, a combined lod score of +3 or greater 
(equivalent to greater than 1000:1 odds in favor of linkage) is considered definitive 
evidence that two loci are linked. Similarly, by convention, a negative lod score of -2 
or less is taken as definitive evidence against linkage of the two loci being compared. 
Negative linkage data are useful in excluding a chromosome or a segment thereof from 

10 consideration. The search focuses on the remaining non-excluded chromosomal 
locations. 

IV. Modified Po lypeptides and Gene Sequences 

The invention further provides variant forms of nucleic acids and 
corresponding proteins. The nucleic acids comprise one of the sequences described in 

15 Table 1, column 8, in which the polymorphic position is occupied by one of the 
alternative bases for that position. Some nucleic acid encode full-length variant forms 
of proteins. Similarly, variant proteins have the prototypical amino acid sequences of 
encoded by nucleic acid sequence shown in Table 1, column 8, (read so as to be in- 
frame with the full-length coding sequence of which it is a component) except at an 

20 amino acid encoded by a codon including one of the polymorphic positions shown in 
the Table. That position is occupied by the amino acid coded by the corresponding 
codon in any of the alternative forms shown in the Table. 

Variant genes can be expressed in an expression vector in which a variant gene 
is operably linked to a native or other promoter. Usually, the promoter is a eukaryotic 

25 promoter for expression in a mammalian cell. The transcription regulation sequences 
typically include a heterologous promoter and optionally an enhancer which is 
recognized by the host. The selection of an appropriate promoter, for example trp, lac , 
phage promoters, glycolytic enzyme promoters and tRNA promoters, depends on the 
host selected. Commercially available expression vectors can be used. Vectors can 

30 include host-recognized replication systems, amplifiable genes, selectable markers, host 
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sequences useful for insertion into the host genome, and the like. 

The means of introducing the expression construct into a host cell varies 
depending upon the particular construction and the target host. Suitable means include 
fusion, conjugation, transfection, transduction, electroporation or injection, as 

5 described in Sambrook, supra. A wide variety of host cells can be employed for 
expression of the variant gene, both prokaryotic and eukaryotic. Suitable host cells 
include bacteria such as E. coli $ yeast, filamentous fungi, insect cells, mammalian 
cells, typically immortalized, e.g. , mouse, CHO, human and monkey cell lines and 
derivatives thereof. Preferred host cells are able to process the variant gene product 

10 to produce an appropriate mature polypeptide. Processing includes glycosylation, 
ubiquitination, disulfide bond formation, general post-translational modification, and 
the like. 

The protein may be isolated by conventional means of protein 
biochemistry and purification to obtain a substantially pure product, i.e., 80, 95 or 

15 99% free of cell component contaminants, as described in Jacoby, Methods in 
Enzymology Volume 104, Academic Press, New York (1984); Scopes, Protein 
Purification, Principles and Practice, 2nd Edition, Springer-Verlag, New York (1987); 
and Deutscher (ed), Guide to Protein Purification, Methods in Enzymology, Vol. 182 
(1990). If the protein is secreted, it can be isolated from the supernatant in which the 

20 host cell is grown. If not secreted, the protein can be isolated from a lysate of the host 
cells. 

The invention further provides transgenic nonhuman animals capable of 
expressing an exogenous variant gene and/or having one or both alleles of an 
endogenous variant gene inactivated. Expression of an exogenous variant gene is 

25 usually achieved by operably linking the gene to a promoter and optionally an 
enhancer, and microinjecting the construct into a zygote. See Hogan et al., 
"Manipulating the Mouse Embryo, A Laboratory Manual," Cold Spring Harbor 
Laboratory. Inactivation of endogenous variant genes can be achieved by forming a 
transgene in which a cloned variant gene is inactivated by insertion of a positive 

30 selection marker. See Capecchi, Science 244, 1288-1292 (1989). The transgene is then 
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introduced into an embryonic stem cell, where it undergoes homologous recombination 
with an endogenous variant gene. Mice and other rodents are preferred animals. Such 
animals provide useful drug screening systems. 

In addition to substantially full-length polypeptides expressed by variant 

5 genes, the present invention includes biologically active fragments of the polypeptides, 
or analogs thereof, including organic molecules which simulate the interactions of the 
peptides. Biologically active fragments include any portion of the full-length 
polypeptide which confers a biological function on the variant gene product, including 
ligand binding, and antibody binding. Ligand binding includes binding by nucleic 

10 acids, proteins or polypeptides, small biologically active molecules, or large cellular 
structures. 

Polyclonal and/or monoclonal antibodies that specifically bind to variant 
gene products but not to corresponding prototypical gene products are also provided. 
Antibodies can be made by injecting mice or other animals with the variant gene 

15 product or synthetic peptide fragments thereof. Monoclonal antibodies are screened 
as are described, for example, in Harlow & Lane, Antibodies, A Laboratory Manual, 
Cold Spring Harbor Press, New York (1988); Goding, Monoclonal antibodies, 
Principles and Practice (2d ed.) Academic Press, New York (1986). Monoclonal 
antibodies are tested for specific immunoreactivity with a variant gene product and lack 

20 of immunoreactivity to the corresponding prototypical gene product. These antibodies 
are useful in diagnostic assays for detection of the variant form, or as an active 
ingredient in a pharmaceutical composition. 

V. Kits 

The invention further provides kits comprising at least one allele-specific 
25 oligonucleotide as described above. Often, the kits contain one or more pairs of allele- 
specific oligonucleotides hybridizing to different forms of a polymorphism. In some 
kits, the allele-specific oligonucleotides are provided immobilized to a substrate. For 
example, the same substrate can comprise allele-specific oligonucleotide probes for 
detecting at least 10, 100 or all of the polymorphisms shown in Table 1. Optional 
30 additional components of the kit include, for example, restriction enzymes, reverse- 
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transcriptase or polymerase, the substrate nucleoside triphosphates, means used to label 
(for example, an avidin-enzyme conjugate and enzyme substrate and chromogen if the 
label is biotin), and the appropriate buffers for reverse transcription, PCR, or 
hybridization reactions. Usually, the kit also contains instructions for carrying out the 
5 methods. 

EXAMPLES 

The polymorphisms shown in Table 1 were identified by resequencing 
of target sequences from eight unrelated individuals of diverse ethnic and geographic 
backgrounds by hybridization to probes immobilized to microfabricated arrays. The 

10 strategy and principles for design and use of such arrays are generally described in WO 
95/11995. The strategy provides arrays of probes for analysis of target sequences 
showing a high degree of sequence identity to the reference sequences of the fragments 
shown in Table 1, column 1. The reference sequences were sequence-tagged sites 
(STSs) developed in the course of the Human Genome Project (see, e.g., Science 270, 

15 1945-1954 (1995); Nature 380, 152-154 (1996)). Most STS's ranged from 100 bp to 
300 bp in size. 

A typical probe array used in this analysis has two groups of four sets 
of probes that respectively tile both strands of a reference sequence. A first probe set 
comprises a plurality of probes exhibiting perfect complementarity with one of the 

20 reference sequences. Each probe in the first probe set has an interrogation position that 
corresponds to a nucleotide in the reference sequence. That is, the interrogation 
position is aligned with the corresponding nucleotide in the reference sequence, when 
the probe and reference sequence are aligned to maximize complementarily between 
the two. For each probe in the first set, there are three corresponding probes from 

25 three additional probe sets. Thus, there are four probes corresponding to each 
nucleotide in the reference sequence. The probes from the three additional probe sets 
are identical to the corresponding probe from the first probe set except at the 
interrogation position, which occurs in the same position in each of the four 
corresponding probes from the four probe sets, and is occupied by a different 

30 nucleotide in the four probe sets. In the present analysis, probes were 25 nucleotides 
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long. Arrays tiled for multiple different references sequences were included on the 
same substrate. 

Multiple target sequences from an individual were amplified from human 
genomic DNA using primers for the fragments indicated in the listed Web sites. The 
5 amplified target sequences were fluorescently labelled during or after PGR. The 
labelled target sequences were hybridized with a substrate bearing immobilized arrays 
of probes. The amount of label bound to probes was measured. Analysis of the 
pattern of label revealed the nature and position of differences between the target and 
reference sequence. For example, comparison of the intensities of four corresponding 

10 probes reveals the identity of a corresponding nucleotide in the target sequences aligned 
with the interrogation position of the probes. The corresponding nucleotide is the 
complement of the nucleotide occupying the interrogation position of the probe 
showing the highest intensity (see WO 95/1 1995). The existence of a polymorphism 
is also manifested by differences in normalized hybridization intensities of probes 

15 flanking the polymorphism when the probes hybridized to corresponding targets from 
different individuals. For example, relative loss of hybridization intensity in a 
"footprint" of probes flanking a polymorphism signals a difference between the target 
and reference (i.e., a polymorphism) (see EP 717,113, incorporated by reference in its 
entirety for all purposes). Additionally, hybridization intensities for corresponding 

20 targets from different individuals can be classified into groups or clusters suggested by 
the data, not defined a priori, such that isolates in a give cluster tend to be similar and 
isolates in different clusters tend to be dissimilar. See WO 97/29212 filed February 
7, 1997 (incorporated by reference in its entirety for all purposes). Hybridizations to 
samples from different individuals were performed separately. Table 1 summarizes the 

25 data obtained for target sequences in comparison with a reference sequence for the 
eight individuals tested. 

From the foregoing, it is apparent that the invention includes a number 
of general uses that can be expressed concisely as follows. The invention provides for 
the use of any of the nucleic acid segments described above in the diagnosis or 

30 monitoring of diseases, such as cancer, inflammation, heart disease, diseases of the 
CNS, and susceptibility to infection by microorganisms. The invention further 
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provides for the use of any of the nucleic acid segments in the manufacture of a 
medicament for the treatment or prophylaxis of such diseases. The invention further 
provides for the use of any of the DNA segments as a pharmaceutical. 

All publications and patent applications cited above are incorporated by 
5 reference in their entirety for all purposes to the same extent as if each individual 
publication or patent application were specifically and individually indicated to be so 
incorporated by reference. Although the present invention has been described in some 
detail by way of illustration and example for purposes of clarity and understanding, it 
will be apparent that certain changes and modifications may be practiced within the 
10 scope of the appended claims. 



WO 98/38846 



PCT/US98/04571 



-49- 

WHAT IS CT f ATMFP IS: 



1 1 A nucleic acid segment of between 10 and 100 bases from a 

2 fragment shown in Table 1 including a polymorphic site, or the complement of the 

3 segment. 

1 2. The nucleic acid segment of claim 1 that is DNA. 

1 3, The nucleic acid segment of claim 1 that is RNA. 

1 4 The segment of claim 1 that is less than 50 bases. 

1 5. The segment of claim 1 that is less than 20 bases. 

1 6. The segment of claim 1, wherein the fragment is 19201 and the 

2 polymorphic site is at position 179. 

1 7. The segment of claim 1, wherein the polymorphic site is 

2 diallelic. 

1 8. The segment of claim 1, wherein the polymorphic form 

2 occupying the polymorphic site is the reference base for the fragment listed in Table 

3 1, column 3. 

1 9. The segment of claim 1, wherein the polymorphic form 

2 occupying the polymorphic site is an alternative form for the fragment listed in Table 

3 1, column 5. 

1 10. An allele-specific oligonucleotide that hybridizes to a segment 

2 of a fragment shown in Table 1, column 8 or its complement. 
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1 11. The allele-specific oligonucleotide of claim 10 that is probe. 

1 12. The allele-specific oligonucleotide of claim 10, wherein a central 

2 position of the probe aligns with the polymorphic site of the fragment. 

1 13 The allele-specific oligonucleotide of claim 10 that is a primer. 

1 14. The allele-specific oligonucleotide of claim 13, wherein the 3' 

2 end of the primer aligns with the polymorphic site of the fragment. 

1 15. An isolated nucleic acid comprising a sequence of Table 1, 

2 column 8 or the complement thereof, wherein the polymorphic site within the sequence 

3 or complement is occupied by a base other than the reference base show in Table 1, 

4 column 3. 

1 16. A method of analyzing a nucleic acid, comprising: 

2 obtaining the nucleic acid from an individual; and 

3 determining a base occupying any one of the polymorphic sites shown in Table 

4 1. 

1 17. The method of claim 16, wherein the determining comprises 

2 determining a set of bases occupying a set of the polymorphic sites shown in Table 1 . 

1 18. The method of claim 16, wherein the nucleic acid is obtained 

2 from a plurality of individuals, and a base occupying one of the polymorphic positions 

3 is determined in each of the individuals, and the method further comprising testing 

4 each individual for the presence of a disease phenotype, and correlating the presence 

5 of the disease phenotype with the base. 



